Mapping of Sequence Reads to the Reference Genomes ◾ 59
rotations then are sorted alphabetically to form a matrix called BWM (Burrows–Wheeler
Matrix) as shown in the second column of Figure 2.7. The last column of characters (in
red) in the BWM is the BWT of the sequence. As shown in Figure 2.8, the BWT of the
sequence S is “AGG$GGTTCTC”. When an aligner uses BWT, it transforms the entire
genome sequence into a BWT. However, computing a BWT using the above naïve approach
is computationally expensive with O(n2 log n) time complexity. Instead, we can use the
suffix array to compute the BWT of a reference genome in O(n) time complexity. This
strategy is utilized by the aligners that use BWT. Constructing a BWT from a suffix array
is straightforward and very simple. To get the ith character of the BWT of the string S,
whose characters are indexed as i=1, 2, …, n, from the suffix array (A), simply use the
following rule:
BWT(S)[i] = S[A[i] - 1] if A[i] > 1 else S[n]
Figure 2.8 shows the indexes i=1, 2, 3, …,11 (first row) of the sequence S (second row) and
suffix array index A (third row), as shown in Figure 2.7. Now to infer the BWT from A,
let us begin from BWT(S)[11]. By applying the rule, BWT(S)[11] = S[A[11] - 1] = S[10] = A.
Likewise, BWT(S)[10] = S[A[10] - 1] = S[9]= G; BWT(S)[6] = S[A[6] - 1]=S[5]=G; but BWT(S)
[1], A[1] is less than 1; hence, BWT(S)[1] = $. For the BWT(S)[i] corresponding to A=9, 5, 8,
4, 7, 3, 2, we will continue using BWT(S)[i] = S[A[i] - 1]. The BWT is shown in Figure 2.9.
The question that comes up is why we need to transform a sequence of a reference
genome into BWT? The BWT serves two purposes; first, BWT groups the characters of
the sequence so that a single character appears many times in a row because the column
is sorted alphabetically and that can be used for sequence compression, which reduces the
memory storage. In this sense, the BWT is compressible and reversible. The second pur-
pose is that BWT is a data structure that can be used for indexing the sequence of a refer-
ence genome to make finding a position of a read fast.
The property of the BWT is that we can reverse it to obtain the BWM by using the last-
to-first column mapping property or simply as LF mapping, where L is the last column
FIGURE 2.7 Cyclic rotations and sorted rotation.